If you took my MLSP class, you may think that you’ve seen this problem. But, it’s actually somewhat different from what you did before, so read carefully. And, this time you SHOULD implement a DNN with at least two hidden layers. So, don’t reuse your legacy MATLAB code for this problem.
When you attended IUB, you took a course taught by Prof. K. Since you really liked his lectures, you decided to record them without the professor’s permission. You felt awkward, but you did it anyway because you really wanted to review his lectures later.
Although you meant to review the lecture every time, it turned out that you never listened to it. After graduation, you realized that a lot of concepts you face at work were actually covered by Prof. K’s class. So, you decided to revisit the lectures and study the materials once again using the recordings.
You should have reviewed your recordings earlier. It turned out that a fellow student who used to sit next to you always ate chips in the middle of the class right beside your microphone. So, Prof. K’s beautiful deep voice was contaminated by the annoying chip eating noise.
But, you vaguly recall that you learned some things about speech denoising and source sep- aration from Prof. K’s class. So, you decided to build a simple deep learning-based speech denoiser that takes a noisy speech spectrum (speech plus chip eating noise) and then produces a cleaned-up speech spectrum.
Since you don’t have Prof. K’s clean speech signal, I prepared this male speech data recorded by other people. train dirty male.wav and train clean male.wav are the noisy speech and its corresponding clean speech you are going to use for training the network. Take a listen to them. Load them and covert them into spectrograms, which are the matrix representation of signals.
import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import imshow
import keras
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.models import Model
from tensorflow.keras import layers
import librosa
import IPython.display as ipd
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
#!pip install librosa # in colab, you’ll need to install this
import librosa
s, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/train_clean_male.wav', sr=None)
S=librosa.stft(s, n_fft=1024, hop_length=512)
sn, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/train_dirty_male.wav', sr=None)
X=librosa.stft(sn, n_fft=1024, hop_length=512)
S_abs = np.abs(S).T
X_abs = np.abs(X).T
#define the keras model
model = Sequential()
He_initializer = tf.keras.initializers.HeNormal()
model.add(Dense(512, input_dim = 513, activation='relu',kernel_initializer=He_initializer))
model.add(Dense(512, activation='relu', kernel_initializer=He_initializer))
model.add(Dense(512, activation='relu', kernel_initializer=He_initializer))
model.add(Dense(513, activation='relu', kernel_initializer=He_initializer))
model.compile(optimizer='adam',loss='mse')
model.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_4 (Dense) (None, 512) 263168 _________________________________________________________________ dense_5 (Dense) (None, 512) 262656 _________________________________________________________________ dense_6 (Dense) (None, 512) 262656 _________________________________________________________________ dense_7 (Dense) (None, 513) 263169 ================================================================= Total params: 1,051,649 Trainable params: 1,051,649 Non-trainable params: 0 _________________________________________________________________
model.fit(x = X_abs, y = S_abs,epochs=100, batch_size=10,shuffle=True,verbose=2)
Epoch 1/100 246/246 - 1s - loss: 0.0076 Epoch 2/100 246/246 - 1s - loss: 0.0080 Epoch 3/100 246/246 - 1s - loss: 0.0065 Epoch 4/100 246/246 - 1s - loss: 0.0067 Epoch 5/100 246/246 - 1s - loss: 0.0065 Epoch 6/100 246/246 - 1s - loss: 0.0074 Epoch 7/100 246/246 - 1s - loss: 0.0072 Epoch 8/100 246/246 - 1s - loss: 0.0058 Epoch 9/100 246/246 - 1s - loss: 0.0057 Epoch 10/100 246/246 - 1s - loss: 0.0062 Epoch 11/100 246/246 - 1s - loss: 0.0064 Epoch 12/100 246/246 - 1s - loss: 0.0057 Epoch 13/100 246/246 - 1s - loss: 0.0056 Epoch 14/100 246/246 - 1s - loss: 0.0057 Epoch 15/100 246/246 - 1s - loss: 0.0054 Epoch 16/100 246/246 - 1s - loss: 0.0055 Epoch 17/100 246/246 - 1s - loss: 0.0052 Epoch 18/100 246/246 - 1s - loss: 0.0053 Epoch 19/100 246/246 - 1s - loss: 0.0061 Epoch 20/100 246/246 - 1s - loss: 0.0054 Epoch 21/100 246/246 - 1s - loss: 0.0057 Epoch 22/100 246/246 - 1s - loss: 0.0050 Epoch 23/100 246/246 - 1s - loss: 0.0050 Epoch 24/100 246/246 - 1s - loss: 0.0050 Epoch 25/100 246/246 - 1s - loss: 0.0047 Epoch 26/100 246/246 - 1s - loss: 0.0046 Epoch 27/100 246/246 - 1s - loss: 0.0045 Epoch 28/100 246/246 - 1s - loss: 0.0049 Epoch 29/100 246/246 - 1s - loss: 0.0049 Epoch 30/100 246/246 - 1s - loss: 0.0052 Epoch 31/100 246/246 - 1s - loss: 0.0053 Epoch 32/100 246/246 - 1s - loss: 0.0061 Epoch 33/100 246/246 - 1s - loss: 0.0051 Epoch 34/100 246/246 - 1s - loss: 0.0042 Epoch 35/100 246/246 - 1s - loss: 0.0043 Epoch 36/100 246/246 - 1s - loss: 0.0048 Epoch 37/100 246/246 - 1s - loss: 0.0043 Epoch 38/100 246/246 - 1s - loss: 0.0043 Epoch 39/100 246/246 - 1s - loss: 0.0045 Epoch 40/100 246/246 - 1s - loss: 0.0042 Epoch 41/100 246/246 - 1s - loss: 0.0040 Epoch 42/100 246/246 - 1s - loss: 0.0040 Epoch 43/100 246/246 - 1s - loss: 0.0039 Epoch 44/100 246/246 - 1s - loss: 0.0043 Epoch 45/100 246/246 - 1s - loss: 0.0045 Epoch 46/100 246/246 - 1s - loss: 0.0041 Epoch 47/100 246/246 - 1s - loss: 0.0044 Epoch 48/100 246/246 - 1s - loss: 0.0051 Epoch 49/100 246/246 - 1s - loss: 0.0044 Epoch 50/100 246/246 - 1s - loss: 0.0044 Epoch 51/100 246/246 - 1s - loss: 0.0045 Epoch 52/100 246/246 - 1s - loss: 0.0040 Epoch 53/100 246/246 - 1s - loss: 0.0033 Epoch 54/100 246/246 - 1s - loss: 0.0031 Epoch 55/100 246/246 - 1s - loss: 0.0032 Epoch 56/100 246/246 - 1s - loss: 0.0035 Epoch 57/100 246/246 - 1s - loss: 0.0036 Epoch 58/100 246/246 - 1s - loss: 0.0038 Epoch 59/100 246/246 - 1s - loss: 0.0036 Epoch 60/100 246/246 - 1s - loss: 0.0037 Epoch 61/100 246/246 - 1s - loss: 0.0035 Epoch 62/100 246/246 - 1s - loss: 0.0035 Epoch 63/100 246/246 - 1s - loss: 0.0041 Epoch 64/100 246/246 - 1s - loss: 0.0050 Epoch 65/100 246/246 - 1s - loss: 0.0050 Epoch 66/100 246/246 - 1s - loss: 0.0045 Epoch 67/100 246/246 - 1s - loss: 0.0038 Epoch 68/100 246/246 - 1s - loss: 0.0035 Epoch 69/100 246/246 - 1s - loss: 0.0031 Epoch 70/100 246/246 - 1s - loss: 0.0031 Epoch 71/100 246/246 - 1s - loss: 0.0031 Epoch 72/100 246/246 - 1s - loss: 0.0030 Epoch 73/100 246/246 - 1s - loss: 0.0032 Epoch 74/100 246/246 - 1s - loss: 0.0033 Epoch 75/100 246/246 - 1s - loss: 0.0036 Epoch 76/100 246/246 - 1s - loss: 0.0035 Epoch 77/100 246/246 - 1s - loss: 0.0037 Epoch 78/100 246/246 - 1s - loss: 0.0038 Epoch 79/100 246/246 - 1s - loss: 0.0035 Epoch 80/100 246/246 - 1s - loss: 0.0041 Epoch 81/100 246/246 - 1s - loss: 0.0036 Epoch 82/100 246/246 - 1s - loss: 0.0029 Epoch 83/100 246/246 - 1s - loss: 0.0032 Epoch 84/100 246/246 - 1s - loss: 0.0030 Epoch 85/100 246/246 - 1s - loss: 0.0030 Epoch 86/100 246/246 - 1s - loss: 0.0033 Epoch 87/100 246/246 - 1s - loss: 0.0037 Epoch 88/100 246/246 - 1s - loss: 0.0038 Epoch 89/100 246/246 - 1s - loss: 0.0044 Epoch 90/100 246/246 - 1s - loss: 0.0032 Epoch 91/100 246/246 - 1s - loss: 0.0027 Epoch 92/100 246/246 - 1s - loss: 0.0025 Epoch 93/100 246/246 - 1s - loss: 0.0025 Epoch 94/100 246/246 - 1s - loss: 0.0027 Epoch 95/100 246/246 - 1s - loss: 0.0030 Epoch 96/100 246/246 - 1s - loss: 0.0036 Epoch 97/100 246/246 - 1s - loss: 0.0039 Epoch 98/100 246/246 - 1s - loss: 0.0040 Epoch 99/100 246/246 - 1s - loss: 0.0036 Epoch 100/100 246/246 - 1s - loss: 0.0036
<keras.callbacks.History at 0x7f3b21878910>
$ \widehat{S} = \frac{X_{test}}{|X_{test}|}\odot \widehat{|S_{test}|} $
which means you take the phase information of the input noisy signal Xtest and use that to |X test | recover the clean speech. ⊙ stands for the Hadamard product and the division is element-wise, too.
test_x_01, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/test_x_01.wav', sr=None)
test_x_01_stft =librosa.stft(test_x_01, n_fft=1024, hop_length=512)
test_x_01_abs = np.abs(test_x_01_stft).T
S_test_x_01 = model.predict(test_x_01_abs)
S_hat_test_x_01 = np.multiply(test_x_01_stft/test_x_01_abs.T,S_test_x_01.T)
S_istft = librosa.istft(S_hat_test_x_01, hop_length=512)
import IPython.display as ipd
ipd.Audio(S_istft, rate = sr)
import soundfile as sf
sf.write('test_s_01_recons.wav', S_istft, sr)
test_x_02, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/test_x_02.wav', sr=None)
test_x_02_stft =librosa.stft(test_x_02, n_fft=1024, hop_length=512)
test_x_02_abs = np.abs(test_x_02_stft).T
S_test_x_02 = model.predict(test_x_02_abs)
S_hat_test_x_02 = np.multiply(test_x_02_stft/test_x_02_abs.T,S_test_x_02.T)
S_istft = librosa.istft(S_hat_test_x_02, hop_length=512)
import soundfile as sf
sf.write('test_s_02_recons.wav', S_istft, sr)
import IPython.display as ipd
ipd.Audio(S_istft, rate = sr)
As an audio guy it’s sad to admit, but a lot of audio signal processing problems can be solved in the time-frequency domain, or an image version of the audio signal. You’ve learned how to do it in the previous homework by using STFT and its inverse process.
What that means is nothing stops you from applying a CNN to the same speech denoising problem. In this question, I’m asking you to implement a 1D CNN that does the speech denoising job in the STFT magnitude domain. 1D CNN here means a variant of CNN which does the convolution operation along only one of the axes. In our case it’s the frequency axis.
Like you did in homework 1 Q2, install/load librosa. Take the magnitude spectrograms of the dirty signal and the clean signal |X| and |S|.
Both in Tensorflow and PyTorch, you’d better transpose this matrix, so that each row of the matrix is a spectrum. Your 1D CNN will take one of these row vectors as an example, i.e. |X|⊤:,i. Since this is not an RGB image with three channels, nor you’ll use any other information than just the magnitude during training, your input image has only one channel (depth-wise). Coupled with your choice of the minibatch size, the dimensionality of your minibatch would be like this: [(batch size) × (number of channels) × (height) × (width)] = [B × 1 × 1 × 513]. Note that depending on the implementation of the 1D CNN layers in TF or PT, it’s okay to omit the height information. Carefully read the definition of the function you’ll use.
You’ll also need to define the size of the kernel, which will be 1 × D, or simply D depending on the implementation (because we know that there’s no convolution along the height axis).
If you define K kernels in the first layer, the output feature map’s dimension will be [B × K × 1 × (513 − D + 1)]. You don’t need too many kernels, but feel free to investigate. You don’t need too many hidden layers, either.
In the end, you know, you have to produce an output matrix of [B × 513], which are the approximation of the clean magnitude spectra of the batch. It’s a dimension hard to match using CNN only, unless you take care of the edges by padding zeros (let’s not do zero-padding for this homework). Hence, you may want to flatten the last feature map as a vector, and add a regular linear layer to reduce that dimensionality down to 513.
s, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/train_clean_male.wav', sr=None)
S=librosa.stft(s, n_fft=1024, hop_length=512)
sn, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/train_dirty_male.wav', sr=None)
X=librosa.stft(sn, n_fft=1024, hop_length=512)
S_abs = np.abs(S).T
X_abs = np.abs(X).T
S_abs = S_abs.reshape(2459,513,1)
X_abs = X_abs.reshape(2459,513,1)
X_abs.shape,S_abs.shape
((2459, 513, 1), (2459, 513, 1))
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import Flatten
model = Sequential()
model.add(Conv1D(filters=64, kernel_size=3, activation='relu',input_shape=(513,1)))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(513, activation='relu'))
model.compile(optimizer='adam',loss='mse')
print(model.summary())
Model: "sequential_4" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv1d_6 (Conv1D) (None, 511, 64) 256 _________________________________________________________________ max_pooling1d_6 (MaxPooling1 (None, 255, 64) 0 _________________________________________________________________ conv1d_7 (Conv1D) (None, 253, 64) 12352 _________________________________________________________________ max_pooling1d_7 (MaxPooling1 (None, 126, 64) 0 _________________________________________________________________ flatten_2 (Flatten) (None, 8064) 0 _________________________________________________________________ dense_8 (Dense) (None, 1024) 8258560 _________________________________________________________________ dense_9 (Dense) (None, 513) 525825 ================================================================= Total params: 8,796,993 Trainable params: 8,796,993 Non-trainable params: 0 _________________________________________________________________ None
model.fit(x = X_abs, y = S_abs,epochs=100, batch_size=128)
Epoch 1/100 20/20 [==============================] - 28s 40ms/step - loss: 0.0571 Epoch 2/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0260 Epoch 3/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0160 Epoch 4/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0120 Epoch 5/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0099 Epoch 6/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0085 Epoch 7/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0076 Epoch 8/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0068 Epoch 9/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0062 Epoch 10/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0058 Epoch 11/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0055 Epoch 12/100 20/20 [==============================] - 0s 16ms/step - loss: 0.0052 Epoch 13/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0049 Epoch 14/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0048 Epoch 15/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0051 Epoch 16/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0050 Epoch 17/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0045 Epoch 18/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0045 Epoch 19/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0043 Epoch 20/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0039 Epoch 21/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0037 Epoch 22/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0043 Epoch 23/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0041 Epoch 24/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0037 Epoch 25/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0033 Epoch 26/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0029 Epoch 27/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0027 Epoch 28/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0029 Epoch 29/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0028 Epoch 30/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0027 Epoch 31/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0031 Epoch 32/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0028 Epoch 33/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0025 Epoch 34/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0024 Epoch 35/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0022 Epoch 36/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0022 Epoch 37/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0021 Epoch 38/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0020 Epoch 39/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0021 Epoch 40/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0022 Epoch 41/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0020 Epoch 42/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0020 Epoch 43/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0019 Epoch 44/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0018 Epoch 45/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0017 Epoch 46/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0019 Epoch 47/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0025 Epoch 48/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0024 Epoch 49/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0022 Epoch 50/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0022 Epoch 51/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0020 Epoch 52/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0020 Epoch 53/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0018 Epoch 54/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0018 Epoch 55/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0017 Epoch 56/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0017 Epoch 57/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0015 Epoch 58/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0015 Epoch 59/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0015 Epoch 60/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0016 Epoch 61/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0017 Epoch 62/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0016 Epoch 63/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0017 Epoch 64/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0018 Epoch 65/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0016 Epoch 66/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0017 Epoch 67/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0015 Epoch 68/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0015 Epoch 69/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0014 Epoch 70/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0015 Epoch 71/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0014 Epoch 72/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0014 Epoch 73/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0015 Epoch 74/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0014 Epoch 75/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0014 Epoch 76/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0014 Epoch 77/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0012 Epoch 78/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0012 Epoch 79/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0012 Epoch 80/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0012 Epoch 81/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0020 Epoch 82/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0017 Epoch 83/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0014 Epoch 84/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0013 Epoch 85/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0011 Epoch 86/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0011 Epoch 87/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0016 Epoch 88/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0026 Epoch 89/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0021 Epoch 90/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0018 Epoch 91/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0015 Epoch 92/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0013 Epoch 93/100 20/20 [==============================] - 0s 18ms/step - loss: 0.0012 Epoch 94/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0013 Epoch 95/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0013 Epoch 96/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0015 Epoch 97/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0018 Epoch 98/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0015 Epoch 99/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0014 Epoch 100/100 20/20 [==============================] - 0s 17ms/step - loss: 0.0011
<keras.callbacks.History at 0x7f3b227a2350>
test_x_01, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/test_x_01.wav', sr=None)
test_x_01_stft =librosa.stft(test_x_01, n_fft=1024, hop_length=512)
test_x_01_abs = np.abs(test_x_01_stft).T
test_x_01_abs_reshape = test_x_01_abs.reshape(142,513,1)
S_test_x_01 = model.predict(test_x_01_abs_reshape)
## Checking the shapes
print("predcition shape",S_test_x_01.shape)
print("stft shape", test_x_01_stft.shape)
print("stft absolute shape",test_x_01_abs.shape )
predcition shape (142, 513) stft shape (513, 142) stft absolute shape (142, 513)
S_hat_test_x_01 = np.multiply(test_x_01_stft/test_x_01_abs.T,S_test_x_01.T)
S_istft = librosa.istft(S_hat_test_x_01, hop_length=512)
ipd.display(ipd.Audio(S_istft,rate=sr,embed=True))
import soundfile as sf
sf.write('test_s_01_recons_cnn1D.wav', S_istft, sr)
test_x_02, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/test_x_02.wav', sr=None)
test_x_02_stft =librosa.stft(test_x_02, n_fft=1024, hop_length=512)
test_x_02_abs = np.abs(test_x_02_stft).T
test_x_02_abs_reshape = test_x_02_abs.reshape(380,513,1)
S_test_x_02 = model.predict(test_x_02_abs_reshape)
## Checking the shapes
print("predcition shape",S_test_x_02.shape)
print("stft shape", test_x_02_stft.shape)
print("stft absolute shape",test_x_02_abs.shape )
predcition shape (380, 513) stft shape (513, 380) stft absolute shape (380, 513)
S_hat_test_x_01 = np.multiply(test_x_02_stft/test_x_02_abs.T,S_test_x_02.T)
S_istft = librosa.istft(S_hat_test_x_02, hop_length=512)
ipd.display(ipd.Audio(S_istft,rate=sr,embed=True))
import soundfile as sf
sf.write('test_s_02_recons_cnn1D.wav', S_istft, sr)
Now that we know the audio source separation problem can be solved in the image represen- tation, nothing stops us from using 2D CNN for this.
To this end, let’s define our input “image” properly. You extract an image of 20 × 513 out of the entire STFT magnitude spectrogram (transposed). That’s an input sample. Using this your 2D CNN estimates the cleaned-up spectrum that corresponds to the last (20th) input frame: into account to predict the clean spectrum of the current frame, t + 19.
Your next image will be another 20 frames shifted by one frame: |S⊤:,t+20| ≈ FCNN|X⊤:,t+1:t+20|, (4) and so on. Therefore, a pair of adjacent images (unless you shuffle the order) will be with 19 overlapped frames. Since your original STFT spectrogram has 2,459 frames, you can create 2,440 such images as your training dataset.
Therefore the input to the 2D CNN will be of [(batch size) ×1 × 20 × 513].
Your 2D CNN should be of course with a kernel whose size along both the width (frequencies) and the height axes (frames) should be larger than 1. Feel free to investigate different sizes.
Otherwise, the basic idea must be similar with the 1D CNN case. You’ll still need those techniques as well as the FC layer.
Report the denoising results in the same way. One thing to note is that your output will be with only 2,440 spectra, and it’s lacking the first 19 frames. You can ignore those first few frames when you calculate the SNR of the training results. A better way is to augment your input X with 19 silent frames (some magnitude spectra with very small random numbers) in the beginning to match the dimension. I recommend the latter approach.
import librosa
s, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/train_clean_male.wav', sr=None)
S=librosa.stft(s, n_fft=1024, hop_length=512)
sn, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/train_dirty_male.wav', sr=None)
X=librosa.stft(sn, n_fft=1024, hop_length=512)
X_abs=np.abs(X).T
S_abs=np.abs(S).T
window=20
l=0
train_3d =X_abs[:window,:]
l=1
while l<=X_abs.shape[0]-window:
train_3d=np.append(train_3d,X_abs[l:l+window,:],axis=0)
l+=1
train_3d=train_3d.reshape((2440,20,513,1))
y_train=S_abs[:2440,:]
y_train=y_train.reshape((2440,513,1))
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
model = Sequential()
model.add(Input(shape = (20,513,1)))
model.add(Conv2D(100, kernel_size=(7,7), activation='relu', strides = 1))
model.add(MaxPooling2D(pool_size=(2, 2), strides=2, padding="valid"))
model.add(Conv2D(50, kernel_size=(5,5), activation='relu',strides = 1))
model.add(MaxPooling2D(pool_size=(2, 2), strides=2, padding="valid"))
model.add(Flatten())
model.add(Dense(513, activation='relu',kernel_initializer = tf.keras.initializers.HeNormal()))
print(model.summary())
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),loss='mse')
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) (None, 14, 507, 100) 5000 _________________________________________________________________ max_pooling2d (MaxPooling2D) (None, 7, 253, 100) 0 _________________________________________________________________ conv2d_1 (Conv2D) (None, 3, 249, 50) 125050 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 1, 124, 50) 0 _________________________________________________________________ flatten (Flatten) (None, 6200) 0 _________________________________________________________________ dense (Dense) (None, 513) 3181113 ================================================================= Total params: 3,311,163 Trainable params: 3,311,163 Non-trainable params: 0 _________________________________________________________________ None
model.fit(x=train_3d,y=y_train,batch_size=32,epochs=100,shuffle=True,verbose=2)
Epoch 1/100 77/77 - 33s - loss: 0.0754 Epoch 2/100 77/77 - 3s - loss: 0.0581 Epoch 3/100 77/77 - 2s - loss: 0.0430 Epoch 4/100 77/77 - 2s - loss: 0.0350 Epoch 5/100 77/77 - 2s - loss: 0.0291 Epoch 6/100 77/77 - 3s - loss: 0.0252 Epoch 7/100 77/77 - 2s - loss: 0.0230 Epoch 8/100 77/77 - 3s - loss: 0.0213 Epoch 9/100 77/77 - 3s - loss: 0.0203 Epoch 10/100 77/77 - 2s - loss: 0.0187 Epoch 11/100 77/77 - 3s - loss: 0.0180 Epoch 12/100 77/77 - 3s - loss: 0.0177 Epoch 13/100 77/77 - 3s - loss: 0.0168 Epoch 14/100 77/77 - 3s - loss: 0.0152 Epoch 15/100 77/77 - 3s - loss: 0.0156 Epoch 16/100 77/77 - 3s - loss: 0.0150 Epoch 17/100 77/77 - 2s - loss: 0.0143 Epoch 18/100 77/77 - 2s - loss: 0.0134 Epoch 19/100 77/77 - 2s - loss: 0.0158 Epoch 20/100 77/77 - 2s - loss: 0.0140 Epoch 21/100 77/77 - 2s - loss: 0.0137 Epoch 22/100 77/77 - 3s - loss: 0.0136 Epoch 23/100 77/77 - 3s - loss: 0.0130 Epoch 24/100 77/77 - 2s - loss: 0.0125 Epoch 25/100 77/77 - 3s - loss: 0.0116 Epoch 26/100 77/77 - 3s - loss: 0.0132 Epoch 27/100 77/77 - 3s - loss: 0.0120 Epoch 28/100 77/77 - 3s - loss: 0.0120 Epoch 29/100 77/77 - 2s - loss: 0.0115 Epoch 30/100 77/77 - 3s - loss: 0.0108 Epoch 31/100 77/77 - 3s - loss: 0.0104 Epoch 32/100 77/77 - 3s - loss: 0.0120 Epoch 33/100 77/77 - 3s - loss: 0.0106 Epoch 34/100 77/77 - 3s - loss: 0.0122 Epoch 35/100 77/77 - 3s - loss: 0.0129 Epoch 36/100 77/77 - 2s - loss: 0.0107 Epoch 37/100 77/77 - 3s - loss: 0.0099 Epoch 38/100 77/77 - 3s - loss: 0.0096 Epoch 39/100 77/77 - 3s - loss: 0.0094 Epoch 40/100 77/77 - 3s - loss: 0.0100 Epoch 41/100 77/77 - 3s - loss: 0.0096 Epoch 42/100 77/77 - 3s - loss: 0.0092 Epoch 43/100 77/77 - 3s - loss: 0.0094 Epoch 44/100 77/77 - 3s - loss: 0.0094 Epoch 45/100 77/77 - 3s - loss: 0.0094 Epoch 46/100 77/77 - 3s - loss: 0.0101 Epoch 47/100 77/77 - 3s - loss: 0.0091 Epoch 48/100 77/77 - 3s - loss: 0.0088 Epoch 49/100 77/77 - 3s - loss: 0.0087 Epoch 50/100 77/77 - 3s - loss: 0.0089 Epoch 51/100 77/77 - 3s - loss: 0.0091 Epoch 52/100 77/77 - 3s - loss: 0.0092 Epoch 53/100 77/77 - 3s - loss: 0.0089 Epoch 54/100 77/77 - 3s - loss: 0.0089 Epoch 55/100 77/77 - 3s - loss: 0.0088 Epoch 56/100 77/77 - 3s - loss: 0.0089 Epoch 57/100 77/77 - 3s - loss: 0.0094 Epoch 58/100 77/77 - 3s - loss: 0.0091 Epoch 59/100 77/77 - 3s - loss: 0.0091 Epoch 60/100 77/77 - 3s - loss: 0.0094 Epoch 61/100 77/77 - 3s - loss: 0.0092 Epoch 62/100 77/77 - 3s - loss: 0.0089 Epoch 63/100 77/77 - 2s - loss: 0.0086 Epoch 64/100 77/77 - 3s - loss: 0.0084 Epoch 65/100 77/77 - 2s - loss: 0.0083 Epoch 66/100 77/77 - 2s - loss: 0.0081 Epoch 67/100 77/77 - 3s - loss: 0.0083 Epoch 68/100 77/77 - 3s - loss: 0.0080 Epoch 69/100 77/77 - 3s - loss: 0.0080 Epoch 70/100 77/77 - 2s - loss: 0.0080 Epoch 71/100 77/77 - 3s - loss: 0.0079 Epoch 72/100 77/77 - 3s - loss: 0.0080 Epoch 73/100 77/77 - 3s - loss: 0.0081 Epoch 74/100 77/77 - 3s - loss: 0.0082 Epoch 75/100 77/77 - 3s - loss: 0.0080 Epoch 76/100 77/77 - 2s - loss: 0.0082 Epoch 77/100 77/77 - 2s - loss: 0.0085 Epoch 78/100 77/77 - 2s - loss: 0.0094 Epoch 79/100 77/77 - 3s - loss: 0.0091 Epoch 80/100 77/77 - 3s - loss: 0.0083 Epoch 81/100 77/77 - 3s - loss: 0.0077 Epoch 82/100 77/77 - 3s - loss: 0.0076 Epoch 83/100 77/77 - 3s - loss: 0.0080 Epoch 84/100 77/77 - 3s - loss: 0.0079 Epoch 85/100 77/77 - 2s - loss: 0.0078 Epoch 86/100 77/77 - 3s - loss: 0.0080 Epoch 87/100 77/77 - 3s - loss: 0.0079 Epoch 88/100 77/77 - 3s - loss: 0.0079 Epoch 89/100 77/77 - 3s - loss: 0.0077 Epoch 90/100 77/77 - 3s - loss: 0.0082 Epoch 91/100 77/77 - 3s - loss: 0.0083 Epoch 92/100 77/77 - 2s - loss: 0.0079 Epoch 93/100 77/77 - 3s - loss: 0.0076 Epoch 94/100 77/77 - 3s - loss: 0.0075 Epoch 95/100 77/77 - 2s - loss: 0.0074 Epoch 96/100 77/77 - 2s - loss: 0.0071 Epoch 97/100 77/77 - 3s - loss: 0.0071 Epoch 98/100 77/77 - 3s - loss: 0.0071 Epoch 99/100 77/77 - 3s - loss: 0.0071 Epoch 100/100 77/77 - 3s - loss: 0.0071
<keras.callbacks.History at 0x7f91c050b090>
predictions=model.predict(train_3d,batch_size=32,verbose=0)
X=X[:,:2440]
mag=(X/np.abs(X))
recovered_signal=mag*predictions.T
recovered_signal=librosa.istft(recovered_signal,hop_length=512)
ipd.Audio(recovered_signal, rate=sr)
test_x_01, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/test_x_01.wav', sr=None)
test_x_01_stft =librosa.stft(test_x_01, n_fft=1024, hop_length=512)
test_x_01_abs = np.abs(test_x_01_stft).T
window=20
l=0
x_test_1 =test_x_01_abs[:window,:]
l=1
while l<=test_x_01_abs.shape[0]-window:
x_test_1=np.append(x_test_1,test_x_01_abs[l:l+window,:],axis=0)
l+=1
x_test_1=x_test_1.reshape((123,20,513,1))
predictions=model.predict(x_test_1,batch_size=32,verbose=0)
test_x_01_stft=test_x_01_stft[:,:123]
mag=(test_x_01_stft/np.abs(test_x_01_stft))
recovered_signal=mag*predictions.T
recovered_signal=librosa.istft(recovered_signal,hop_length=512)
ipd.Audio(recovered_signal, rate=sr)
import soundfile as sf
sf.write('/content/drive/MyDrive/Deep Learning Assignment/test_s_01_recons_cnn2.wav', recovered_signal, sr)
test_x_02, sr=librosa.load('/content/drive/MyDrive/Deep Learning Assignment/test_x_02.wav', sr=None)
test_x_02_stft =librosa.stft(test_x_02, n_fft=1024, hop_length=512)
test_x_02_abs = np.abs(test_x_02_stft).T
window=20
l=0
x_test_2 =test_x_02_abs[:window,:]
l=1
while l<=test_x_02_abs.shape[0]-window:
x_test_2=np.append(x_test_2,test_x_02_abs[l:l+window,:],axis=0)
l+=1
x_test_2=x_test_2.reshape((361,20,513,1))
predictions=model.predict(x_test_2,batch_size=32,verbose=0)
test_x_02_stft=test_x_02_stft[:,:361]
mag=(test_x_02_stft/np.abs(test_x_02_stft))
recovered_signal=mag*predictions.T
recovered_signal=librosa.istft(recovered_signal,hop_length=512)
ipd.Audio(recovered_signal, rate=sr)
import soundfile as sf
sf.write('/content/drive/MyDrive/Deep Learning Assignment/test_s_02_recons_cnn2.wav', recovered_signal, sr)